library(tidyverse)
library(knitr)
library(janitor)
library("readxl")
library(ggfortify)
library(GGally)
library(qtlcharts)
library(leaps)
library(sjPlot)
library(pheatmap)
Excess weight, especially obesity, has become an epidemic in the 21st century and resulted in many significant health and economic consequences for the global population (Stein and Colditz, 2004). In Australia, this epidemic has also spread as 2 in 3 adults are classified as overweight or obese (Australian Institute of Health and Welfare). Researches has shown that this epidemic is more common in males than females and hence, BYU Human Performance Research Center has collected data from 250 men of various age and obtained estimates of the percentage of body fat through underwater weighing and various body circumference measurements (Rahman and Harding, 2013; DASL, n.d.). As body fat percentage is difficult to calculate in real life, the value for body fat percentage was derived from body density using the Siri’s 1956 equation.
There were not many information given with regards to the sampling method. However, from looking at the dataset, there is gender biase as the epidemic question is one related to both gender, yet only male were involved in the sample. This suggests that any analysis based on this dataset cannot be applied to the whole population but only the male population.
data = read.delim("bodyfat.txt") %>% janitor::clean_names()
data = data %>%
mutate(bmi = (data$weight/(data$height ^ 2)) * 703,
overweight = case_when(
bmi >= 25 ~ 1,
bmi < 25 ~ 0))
#colnames(data)
data_bmi = data[-c(1:2,4:5,18)]
data_bf = data[-c(1,3:5,17:18)]
data_density = data[-c(2:5,17:18)]
<<<<<<< HEAD
#glimpse(data)
=======
colnames(data_density)
## [1] "density" "neck" "chest" "abdomen" "waist" "hip" "thigh"
## [8] "knee" "ankle" "bicep" "forearm" "wrist"
#glimpse(data)
This report aims to determine an alternative method to determine “over-weight” individuals oppose to body fat percentage and the two alternative methods considered are:
The analysis will first begin through determining how much variation in body fat percentage can be explained by simply body measurements and the number of measurements that is significant in building an accurate prediction model to examine the ease of calculation.
Then similar analysis will be conducted on BMI and Body Density where the end result will be compared together to determine which method can be explained the best using simple body measurements and offers easier and simpler interpretation.
Lastly, a binary indicator will be added to differentiate the sample into over-weight individuals (Yes) and non-over-weight individuals (No) using body-fat percentage as the guiding criteria. A logstic regression is run on the binary indicator with BMI, body density and age for a simpler model to determine the odds of an individual being obesed.
>>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4Due to the increasing consumptions of fast food and the increasing convenience of food deliveries, concerns about obesity level is rising throughput the world and has reached a new high. This increasing concern has lead to an increasing need to measure obesity accurately and percentage body fat is arguably the most accurate measure by far. However, the calculation of body fat is difficult and many has switched to Body Mass Index (BMI) for simpler calculation. This section is looking at comparing the results from predicting body fat percentage using other body measurements and predicting BMI using other body measurements to determine wh body measurement is the most important in determining obesity.
Due to the increasing consumptions of fast food and the increasing convenience of food deliveries, concerns about obesity level is rising throughput the world and has reached a new high. This increasing concern has lead to an increasing need to measure obesity accurately and percentage body fat is arguably the most accurate measure by far. However, the calculation of body fat is difficult and many has switched to Body Mass Index (BMI) for simpler calculation. This section is looking at how much variation that simple body measurements can explain in the three methods of interest - body fat percentage, BMI and body density.
\[ Pcf.BF = \beta_0 + \beta_1Weight + \beta_2Height + \beta_3Neck + \beta_4Chest \\ + \beta_5Abdomen + \beta_6Waist + \beta_7Hip + \beta_8Thigh + \beta_9Knee + \beta_{10}Ankle \\+ \beta_{11}Bicep + \beta_{12}Forearm + \beta_{13}Wrist + \epsilon \]
qtlcharts::iplotCorr(data_bf)
## Set screen size to height=700 x width=1000
Based on the interactive correlation matrix above, it can be seen the level of correlation differs quite drastically between the variables and the backward variable selection method is adopted.
Due to the increasing consumptions of fast food and the increasing convenience of food deliveries, concerns about obesity level is rising throughput the world and has reached a new high. This increasing concern has lead to an increasing need to measure obesity accurately and percentage body fat is arguably the most accurate measure by far. However, the calculation of body fat is difficult and many has switched to Body Mass Index (BMI) for simpler calculation. This section is looking at comparing the results from predicting body fat percentage using other body measurements and predicting BMI using other body measurements to determine wh body measurement is the most important in determining obesity.
qtlcharts::iplotCorr(data_bf)
## Set screen size to height=700 x width=1000
<<<<<<< HEAD
Based on the interactive correlation matrix, it can be seen the level of correlation differs quite drastically between the variables and the backward variable selection method is adopted.
bf_lm = lm(pct_bf~.,data=data_bf)
summary(bf_lm)
##
## Call:
## lm(formula = pct_bf ~ ., data = data_bf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8684 -2.9088 -0.1904 3.0491 11.1421
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.20340 6.83392 0.322 0.74742
## neck -0.45612 0.23034 -1.980 0.04882 *
## chest -0.13005 0.09197 -1.414 0.15866
## abdomen 1.03299 0.07638 13.524 < 2e-16 ***
## waist NA NA NA NA
## hip -0.33000 0.12768 -2.585 0.01034 *
## thigh 0.08793 0.13395 0.656 0.51217
## knee -0.13537 0.22744 -0.595 0.55227
## ankle 0.05505 0.21751 0.253 0.80041
## bicep 0.17762 0.17029 1.043 0.29798
## forearm 0.19468 0.20718 0.940 0.34834
## wrist -1.52499 0.50529 -3.018 0.00282 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.341 on 239 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.726
## F-statistic: 66.98 on 10 and 239 DF, p-value: < 2.2e-16
Using the individual p-value method, the varaibles that need to be dropped are chest, waist, thigh, knee,ankle, bicep, forearm with ankle being the first to drop down due to its high p-value. However, to double check, the AIC criterion will also be considered.
bf_step_back = step(bf_lm, direction = "backward",trace = FALSE)
summary(bf_step_back)
##
## Call:
## lm(formula = pct_bf ~ neck + chest + abdomen + hip + bicep +
## wrist, data = data_bf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.668 -2.889 -0.361 3.210 11.148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.52703 6.63727 0.230 0.818232
## neck -0.39650 0.22234 -1.783 0.075783 .
## chest -0.12810 0.08992 -1.425 0.155562
## abdomen 1.01805 0.07431 13.700 < 2e-16 ***
## hip -0.28758 0.09232 -3.115 0.002060 **
## bicep 0.26094 0.15160 1.721 0.086469 .
## wrist -1.55084 0.45510 -3.408 0.000767 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.32 on 243 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.7287
## F-statistic: 112.5 on 6 and 243 DF, p-value: < 2.2e-16
Based on the backward selection model, the fitted model has become:
<<<<<<< HEAD\[ \hat{Body - Fat} = 1.52 -0.3965Neck - 0.128Chest + 1.01805Abdomen -0.28758Hip + 0.26Vicep -1.55084Wrist \]
Finally, to check assumption, we perform the ggfortify function.
par(mfrow=c(1,2))
plot(bf_step_back,which=1:2) + theme_bw()
## NULL
The QQ plot shows a straight line which indicates that the normality assumption is reasonable. However, the residuals vs fitted plot shows a slight variation; but given that body fat is hard to predict, this is acceptable.
relbf <- function(fit,...){
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda ^ 2
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta ^ 2)
rawwgt <- lambdasq %*% beta ^ 2
import <- (rawwgt / rsquare) * 100
import <- as.data.frame(import)
row.names(import) <- names(fit$model[2:nvar])
names(import) <- "Weights"
import <- import[order(import),1, drop=FALSE]
dotchart(import$Weights, labels=row.names(import),
xlab="% of R-Square", pch=19,
main="Relative Importance of Predictor Variables",
sub=paste("Total R-Square=", round(rsquare, digits=3)),
...)
return(import)
}
relbf(bf_step_back, col="blue")
## Weights
## wrist 4.038431
## bicep 7.746760
## neck 8.238378
## hip 16.313788
## chest 21.795458
## abdomen 41.867184
The final model is:
\[ \hat{body-fat} = 1.52 -0.3965neck - 0.128chest + 1.01805abdomen -0.28758hip + 0.26bicep -1.55084wrist \]
and abdomen is relatively the most important predictor for predicting body fat percentage.
Looking at the \(R^2\) value (multiple R-squared) from the summary output, 73.5% of the variability of density is explained by the regression on percentage of Height, Neck, Chest, Abdomen.
For this analysis, the formula of BMI is
\[ BMI = \frac{Weight (lbs)*703}{Height(in)^2} \]
\[ BMI = \beta_0 + \beta_1Neck + \beta_2Chest \\ + \beta_3Abdomen + \beta_4Waist + \beta_5Hip + \beta_6Thigh + \beta_7Knee + \beta_8Ankle \\+ \beta_9Bicep + \beta_{10}Forearm + \beta_{11}Wrist + \epsilon \]
qtlcharts::iplotCorr(data_bmi)
Based on the interactive correlation matrix, it can be seen the level of correlation differs quite drastically between the variables and the backward variable selection method is also adopted here.
$ = 1.52 -0.3965neck - 0.128chest + 1.01805abdomen -0.28758hip + 0.26bicep -1.55084wrist $
Finally, to check assumption, we perform the ggfortify function.
par(mfrow=c(1,2))
plot(bf_step_back,which=1:2) + theme_bw()
## NULL
The QQ plot shows a straight line which indicates that the normality assumption is reasonable. However, the residuals vs fitted plot shows a slight variation; but given that body fat is hard to predict, this is acceptable.
$ = 1.52 -0.3965neck - 0.128chest + 1.01805abdomen -0.28758hip + 0.26bicep -1.55084wrist $
For this analysis, the formula of BMI is \(BMI = \frac{Weight (lbs)*703}{Height(in)^2}\)
qtlcharts::iplotCorr(data_bmi)
<<<<<<< HEAD
Based on the interactive correlation matrix, it can be seen the level of correlation differs quite drastically between the variables and the backward variable selection method is adopted.
bmi_lm = lm(bmi~.,data=data_bmi)
summary(bmi_lm)
##
## Call:
## lm(formula = bmi ~ ., data = data_bmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1790 -0.6548 0.0086 0.6881 3.8335
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.938661 1.725613 -6.919 4.19e-11 ***
## age 0.010049 0.007594 1.323 0.1870
## neck 0.031474 0.056108 0.561 0.5754
## chest 0.149956 0.022420 6.688 1.59e-10 ***
## abdomen 0.118223 0.020898 5.657 4.40e-08 ***
## waist NA NA NA NA
## hip 0.060113 0.032231 1.865 0.0634 .
## thigh 0.151883 0.034888 4.353 1.99e-05 ***
## knee -0.265374 0.056116 -4.729 3.87e-06 ***
## ankle 0.067577 0.053692 1.259 0.2094
## bicep 0.049712 0.041497 1.198 0.2321
## forearm 0.086789 0.051015 1.701 0.0902 .
## wrist -0.045825 0.129082 -0.355 0.7229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.058 on 238 degrees of freedom
## Multiple R-squared: 0.9046, Adjusted R-squared: 0.9002
## F-statistic: 205.1 on 11 and 238 DF, p-value: < 2.2e-16
Using the individual p-value method, the varaibles that need to be dropped are hip, ankle, bicep, forearm and wrist. To double check, the AIC criterion will also be considered.
bmi_step_back = step(bmi_lm, direction = "backward",trace = FALSE)
summary(bmi_step_back)
##
## Call:
## lm(formula = bmi ~ chest + abdomen + hip + thigh + knee + forearm,
## data = data_bmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1197 -0.6944 -0.0274 0.6831 3.8464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.94257 1.43829 -7.608 6.10e-13 ***
## chest 0.16090 0.02122 7.582 7.18e-13 ***
## abdomen 0.12726 0.01826 6.968 3.01e-11 ***
## hip 0.05047 0.03084 1.637 0.1030
## thigh 0.14983 0.03032 4.942 1.44e-06 ***
## knee -0.23116 0.05148 -4.490 1.10e-05 ***
## forearm 0.11484 0.04468 2.571 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.058 on 243 degrees of freedom
## Multiple R-squared: 0.9024, Adjusted R-squared: 0.9
## F-statistic: 374.6 on 6 and 243 DF, p-value: < 2.2e-16
Based on the backward selection model, the fitted model has become:
<<<<<<< HEAD\[ \hat{BMI} = -10.94 +0.161Chest + 0.127Abdomen + 0.050Hip + 0.150 Thigh - 0.23Knee + 0.115Forearm \]
Finally, to check assumption, we perform the ggfortify function.
par(mfrow=c(1,2))
plot(bmi_step_back,which=1:2) + theme_bw()
$ = -10.94 +0.161chest + 0.127abdomen + 0.050hip + 0.150 thigh - 0.23knee + 0.115forearm $
Finally, to check assumption, we perform the ggfortify function.
par(mfrow=c(1,2))
plot(bmi_step_back,which=1:2) + theme_bw()
## NULL
The QQ plot shows a straight line which indicates that the normality assumption is reasonable. However, the residuals vs fitted plot shows a fan shaped plot which indicates that the assumption of homogeneous variance is violated. We can use a log transformed response and re-fit the linear regression.
The new model will become: $log() = 1.83 +0.0058chest + 0.0052abdomen + 0.0064 thigh -0.0065knee + 0.0028bicep + 0.0040 forearm $.
ln_bmi_lm = lm(log(bmi)~.,data=data_bmi)
summary(ln_bmi_lm)
##
## Call:
## lm(formula = log(bmi) ~ ., data = data_bmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.133137 -0.024215 -0.000443 0.027919 0.102670
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7512060 0.0654240 26.767 < 2e-16 ***
## age 0.0005158 0.0002879 1.791 0.074515 .
## neck 0.0017502 0.0021272 0.823 0.411459
## chest 0.0055041 0.0008500 6.475 5.37e-10 ***
## abdomen 0.0044728 0.0007923 5.645 4.68e-08 ***
## waist NA NA NA NA
## hip 0.0010150 0.0012220 0.831 0.407026
## thigh 0.0069490 0.0013227 5.253 3.31e-07 ***
## knee -0.0082150 0.0021276 -3.861 0.000145 ***
## ankle 0.0026638 0.0020357 1.309 0.191940
## bicep 0.0024437 0.0015733 1.553 0.121689
## forearm 0.0038084 0.0019341 1.969 0.050112 .
## wrist -0.0017188 0.0048940 -0.351 0.725745
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04009 on 238 degrees of freedom
## Multiple R-squared: 0.9079, Adjusted R-squared: 0.9036
## F-statistic: 213.2 on 11 and 238 DF, p-value: < 2.2e-16
ln_bmi_step_back = step(ln_bmi_lm, direction = "backward",trace = FALSE)
summary(ln_bmi_step_back)
##
## Call:
## lm(formula = log(bmi) ~ age + chest + abdomen + thigh + knee +
## bicep + forearm, data = data_bmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.132191 -0.023020 -0.000321 0.027716 0.106640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7989069 0.0524420 34.303 < 2e-16 ***
## age 0.0004265 0.0002629 1.622 0.106083
## chest 0.0058017 0.0008196 7.079 1.57e-11 ***
## abdomen 0.0047155 0.0007078 6.662 1.80e-10 ***
## thigh 0.0075055 0.0011966 6.273 1.62e-09 ***
## knee -0.0069254 0.0018984 -3.648 0.000324 ***
## bicep 0.0026331 0.0015455 1.704 0.089720 .
## forearm 0.0042579 0.0018372 2.318 0.021306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04002 on 242 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.904
## F-statistic: 335.8 on 7 and 242 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(ln_bmi_step_back,which=1:2) + theme_bw()
<<<<<<< HEAD
## NULL
sjPlot::tab_model(bmi_step_back, ln_bmi_step_back, digits = 5, show.ci = FALSE)
| bmi | log(bmi) | |||
|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p |
| (Intercept) | -10.94257 | <0.001 | 1.79891 | <0.001 |
| chest | 0.16090 | <0.001 | 0.00580 | <0.001 |
| abdomen | 0.12726 | <0.001 | 0.00472 | <0.001 |
| hip | 0.05047 | 0.103 | ||
| thigh | 0.14983 | <0.001 | 0.00751 | <0.001 |
| knee | -0.23116 | <0.001 | -0.00693 | <0.001 |
| forearm | 0.11484 | 0.011 | 0.00426 | 0.021 |
| age | 0.00043 | 0.106 | ||
| bicep | 0.00263 | 0.090 | ||
| Observations | 250 | 250 | ||
| R2 / R2 adjusted | 0.902 / 0.900 | 0.907 / 0.904 | ||
$log() = 1.83 +0.0058chest + 0.0052abdomen + 0.0064 thigh -0.0065knee + 0.0028bicep + 0.0040 forearm $.
However, although the transformation has aided with the homogeneous variance assumption, the interpretation itself does not make much sense - BMI is determined by an increase in value, not the increase in percentage change. Hence in the final comparison, we will use the untransformed model.
relbmi <- function(fit,...){
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda ^ 2
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta ^ 2)
rawwgt <- lambdasq %*% beta ^ 2
import <- (rawwgt / rsquare) * 100
import <- as.data.frame(import)
row.names(import) <- names(fit$model[2:nvar])
names(import) <- "Weights"
import <- import[order(import),1, drop=FALSE]
dotchart(import$Weights, labels=row.names(import),
xlab="% of R-Square", pch=19,
main="Relative Importance of Predictor Variables",
sub=paste("Total R-Square=", round(rsquare, digits=3)),
...)
return(import)
}
relbmi(bmi_step_back, col="blue")
## Weights
## forearm 9.021850
## knee 9.080285
## thigh 15.075099
## hip 17.245099
## abdomen 24.705834
## chest 24.871833
The final model is:
\[\hat{BMI} = -10.94 +0.161Chest + 0.127Abdomen + 0.050Hip + 0.150 Thigh - 0.23Knee + 0.115Forearm \] and both chest and abdomen are relatively more important in predicting BMI.
Looking at the \(R^2\) value (multiple R-squared) from the summary output, 90.2% of the variability of density is explained by the regression on percentage of Height, Neck, Chest, Abdomen.$log() = 1.83 +0.0058chest + 0.0052abdomen + 0.0064 thigh -0.0065knee + 0.0028bicep + 0.0040 forearm $.
\[ Body Density = \beta_0 + \beta_1Pcf.BF + \beta_2Age + \beta_3Weight + \beta_4Height\\ + \beta_5Neck + \beta_6Chest + \beta_7Abdomen + \beta_8Waist + \beta_9Hip + \beta_{10}Thigh\\ + \beta_{11}Knee + \beta_{12}Ankle + \beta_{13}Bicep + \beta_{14}Forearm + \beta_{15}Wrist + \epsilon \]
#data1<-data_density[,-2]
cor_matrix <- cor(data_density)
pheatmap(cor_matrix, display_numbers = T,na.rm=T)
<<<<<<< HEAD
Above matrix has shown the interactice correlation between variables. Notbaly, Pct.BF has a -0.99 relationship with Density, which means Pct.BF could be used to explain Density. Meanwhile, variables having similar properties are linked together, which could be useful for generating groups.
Above matrix has shown the interactice correlation between variables. Notbaly, Pct.BF has a -0.99 relationship with Density, which means Pct.BF could be used to explain Density. Meanwhile, variables having similar properties are linked together, which could be useful for generating groups.
Above matrix has shown the interactice correlation between variables. Notbaly, Pct.BF has a -0.99 relationship with Density, which means Pct.BF could be used to explain Density. Meanwhile, variables having similar properties are linked together, which could be useful for generating groups.
The residuals \(\epsilon_i\) are iid \(N(0,\sigma^2)\) and there is a linear relationship between y and x.
M0 <- lm(density ~ 1, data = data_density) # Null model
M1 <- lm(density ~ ., data = data_density) # Full model
autoplot(M1,which=1:2)+theme_bw()
<<<<<<< HEAD
round(summary(M1)$coef, 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.082 0.016 68.068 0.000
## neck 0.001 0.001 2.074 0.039
## chest 0.000 0.000 1.818 0.070
## abdomen -0.002 0.000 -13.645 0.000
## hip 0.001 0.000 2.950 0.003
## thigh 0.000 0.000 -0.980 0.328
## knee 0.000 0.001 0.689 0.492
## ankle 0.000 0.001 -0.675 0.500
## bicep -0.001 0.000 -1.357 0.176
## forearm 0.000 0.000 -0.947 0.345
## wrist 0.004 0.001 3.310 0.001
step.fwd.aic <- step(M0, scope = list(lower = M0, upper = M1), direction = "forward", trace = FALSE)
summary(step.fwd.aic)
##
## Call:
## lm(formula = density ~ waist + wrist + hip + chest + bicep +
## neck, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.024142 -0.007680 0.000523 0.006156 0.038390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0844660 0.0154768 70.070 < 2e-16 ***
## waist -0.0060402 0.0004401 -13.724 < 2e-16 ***
## wrist 0.0038812 0.0010612 3.657 0.000312 ***
## hip 0.0006990 0.0002153 3.247 0.001331 **
## chest 0.0003881 0.0002097 1.851 0.065427 .
## bicep -0.0007779 0.0003535 -2.201 0.028695 *
## neck 0.0009609 0.0005185 1.853 0.065030 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01007 on 243 degrees of freedom
## Multiple R-squared: 0.722, Adjusted R-squared: 0.7152
## F-statistic: 105.2 on 6 and 243 DF, p-value: < 2.2e-16
step.back.aic <- step(M1, scope = list(lower = M0, upper = M1), direction = "backward", trace = FALSE)
summary(step.back.aic)
##
## Call:
## lm(formula = density ~ neck + chest + abdomen + hip + bicep +
## wrist, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.024142 -0.007680 0.000523 0.006156 0.038390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0844660 0.0154768 70.070 < 2e-16 ***
## neck 0.0009609 0.0005185 1.853 0.065030 .
## chest 0.0003881 0.0002097 1.851 0.065428 .
## abdomen -0.0023780 0.0001733 -13.724 < 2e-16 ***
## hip 0.0006990 0.0002153 3.247 0.001331 **
## bicep -0.0007779 0.0003535 -2.201 0.028694 *
## wrist 0.0038812 0.0010612 3.657 0.000312 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01007 on 243 degrees of freedom
## Multiple R-squared: 0.722, Adjusted R-squared: 0.7152
## F-statistic: 105.2 on 6 and 243 DF, p-value: < 2.2e-16
exh <- regsubsets(density~., data = data_density, nvmax = 15)
## Warning in leaps.exhaustive(a, really.big): XHAUST returned error code -999
plot(exh,scale="bic")
<<<<<<< HEAD
M2<- lm(formula = density ~ neck + chest + abdomen,
data = data_density)
summary(M2)
##
## Call:
## lm(formula = density ~ neck + chest + abdomen, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.029395 -0.007156 -0.000682 0.007305 0.046687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1392164 0.0119499 95.333 < 2e-16 ***
## neck 0.0017212 0.0004599 3.743 0.000226 ***
## chest 0.0004569 0.0002135 2.140 0.033361 *
## abdomen -0.0021095 0.0001592 -13.250 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01057 on 246 degrees of freedom
## Multiple R-squared: 0.6904, Adjusted R-squared: 0.6867
## F-statistic: 182.9 on 3 and 246 DF, p-value: < 2.2e-16
M3<- lm(formula = density ~ neck + chest + abdomen + waist ,
data = data_density)
summary(M3)
##
## Call:
## lm(formula = density ~ neck + chest + abdomen + waist, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.029395 -0.007156 -0.000682 0.007305 0.046687
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1392164 0.0119499 95.333 < 2e-16 ***
## neck 0.0017212 0.0004599 3.743 0.000226 ***
## chest 0.0004569 0.0002135 2.140 0.033361 *
## abdomen -0.0021095 0.0001592 -13.250 < 2e-16 ***
## waist NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01057 on 246 degrees of freedom
## Multiple R-squared: 0.6904, Adjusted R-squared: 0.6867
## F-statistic: 182.9 on 3 and 246 DF, p-value: < 2.2e-16
Drop waist and add other variables
M4<- lm(formula = density ~ neck + chest + abdomen + hip ,
data = data_density)
summary(M4)
##
## Call:
## lm(formula = density ~ neck + chest + abdomen + hip, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.028989 -0.007256 0.000047 0.006767 0.045116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1150223 0.0139695 79.818 < 2e-16 ***
## neck 0.0014682 0.0004584 3.203 0.00154 **
## chest 0.0003734 0.0002113 1.768 0.07837 .
## abdomen -0.0023671 0.0001759 -13.455 < 2e-16 ***
## hip 0.0006619 0.0002074 3.191 0.00160 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01037 on 245 degrees of freedom
## Multiple R-squared: 0.7028, Adjusted R-squared: 0.6979
## F-statistic: 144.8 on 4 and 245 DF, p-value: < 2.2e-16
M5<- lm(formula = density ~ neck + chest + abdomen + hip + thigh ,
data = data_density)
summary(M5)
##
## Call:
## lm(formula = density ~ neck + chest + abdomen + hip + thigh,
## data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.029545 -0.006988 0.000516 0.007098 0.043367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1073577 0.0144127 76.832 < 2e-16 ***
## neck 0.0016460 0.0004644 3.544 0.000472 ***
## chest 0.0003393 0.0002107 1.610 0.108626
## abdomen -0.0023869 0.0001752 -13.627 < 2e-16 ***
## hip 0.0010639 0.0002889 3.682 0.000285 ***
## thigh -0.0005716 0.0002878 -1.986 0.048171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01031 on 244 degrees of freedom
## Multiple R-squared: 0.7075, Adjusted R-squared: 0.7015
## F-statistic: 118 on 5 and 244 DF, p-value: < 2.2e-16
Drop chest and add other variables
M6<- lm(formula = density ~ neck + abdomen + hip + thigh + knee ,
data = data_density)
summary(M6)
##
## Call:
## lm(formula = density ~ neck + abdomen + hip + thigh + knee, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.029361 -0.007760 0.000360 0.007152 0.043816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1035095 0.0150104 73.517 < 2e-16 ***
## neck 0.0018149 0.0004395 4.129 5e-05 ***
## abdomen -0.0022098 0.0001347 -16.402 < 2e-16 ***
## hip 0.0010035 0.0002984 3.363 0.000894 ***
## thigh -0.0007038 0.0002938 -2.395 0.017355 *
## knee 0.0007551 0.0005006 1.508 0.132778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01032 on 244 degrees of freedom
## Multiple R-squared: 0.7071, Adjusted R-squared: 0.7011
## F-statistic: 117.8 on 5 and 244 DF, p-value: < 2.2e-16
Drop knee
M7<- lm(formula = density ~ neck + abdomen + hip + thigh ,
data = data_density)
summary(M7)
##
## Call:
## lm(formula = density ~ neck + abdomen + hip + thigh, data = data_density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.030204 -0.007301 0.000653 0.007199 0.045107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1104052 0.0143343 77.465 < 2e-16 ***
## neck 0.0019085 0.0004362 4.375 1.8e-05 ***
## abdomen -0.0022064 0.0001351 -16.337 < 2e-16 ***
## hip 0.0011314 0.0002868 3.945 0.000104 ***
## thigh -0.0006094 0.0002878 -2.117 0.035236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01035 on 245 degrees of freedom
## Multiple R-squared: 0.7044, Adjusted R-squared: 0.6996
## F-statistic: 146 on 4 and 245 DF, p-value: < 2.2e-16
<<<<<<< HEAD
=======
relweights <- function(fit,...){
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda ^ 2
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta ^ 2)
rawwgt <- lambdasq %*% beta ^ 2
import <- (rawwgt / rsquare) * 100
import <- as.data.frame(import)
row.names(import) <- names(fit$model[2:nvar])
names(import) <- "Weights"
import <- import[order(import),1, drop=FALSE]
dotchart(import$Weights, labels=row.names(import),
xlab="% of R-Square", pch=19,
main="Relative Importance of Predictor Variables",
sub=paste("Total R-Square=", round(rsquare, digits=3)),
...)
return(import)
}
relweights(M7, col="blue")
<<<<<<< HEAD
## Weights
## neck 11.06366
## thigh 13.82063
## hip 19.73563
## abdomen 55.38008
>>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4
Obviously, abdomen contributes the most in the relationship with body density.
<<<<<<< HEAD The final model is: \[ =======autoplot(M7,which=1:2)+theme_bw()
<<<<<<< HEAD
Since overweight is a binary field.. logistical regression….
data_overweight = data[-c(4:5,17)]
glm1 = glm(overweight ~ ., data = data_overweight)
# drop knee
glm2 = glm(overweight ~ density + pct_bf + age + neck + chest + abdomen + waist + hip + thigh + ankle + bicep + forearm + wrist, data = data_overweight)
# drop ankle
glm3 = glm(overweight ~ density + pct_bf + age + neck + chest + abdomen + waist + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop density
glm4 = glm(overweight ~ pct_bf + age + neck + chest + abdomen + waist + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop age
glm5 = glm(overweight ~ pct_bf + neck + chest + abdomen + waist + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop waist
glm6 = glm(overweight ~ pct_bf + neck + chest + abdomen + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop neck
glm7 = glm(overweight ~ pct_bf + chest + abdomen + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop pct_bf
glm8 = glm(overweight ~ chest + abdomen + hip + thigh + bicep + forearm + wrist, data = data_overweight)
# drop waist
glm9 = glm(overweight ~ chest + abdomen + hip + thigh + bicep + forearm, data = data_overweight)
# drop wrist
glm10 = glm(overweight ~ chest + abdomen + hip + thigh + bicep + forearm, data = data_overweight)
# drop forearm
glm11 = glm(overweight ~ chest + abdomen + hip + thigh + bicep, data = data_overweight)
# drop thigh
glm12 = glm(overweight ~ chest + abdomen + hip + bicep, data = data_overweight)
# drop hip
glm13 = glm(overweight ~ chest + abdomen + bicep, data = data_overweight)
summary(glm13)
##
## Call:
## glm(formula = overweight ~ chest + abdomen + bicep, data = data_overweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.69987 -0.25440 -0.02293 0.26254 0.68701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.845275 0.282551 -13.609 < 2e-16 ***
## chest 0.013281 0.006416 2.070 0.039503 *
## abdomen 0.020164 0.004805 4.196 3.79e-05 ***
## bicep 0.035618 0.009834 3.622 0.000355 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1026899)
##
## Null deviance: 62.500 on 249 degrees of freedom
## Residual deviance: 25.262 on 246 degrees of freedom
## AIC: 146.43
##
## Number of Fisher Scoring iterations: 2
Before we start making predictions with the model, we drop the variables which are not a significant predictor for being overweight. The fitted model is shown below.
\[ logit(p) = log(\frac{p}{1-p}) = -3.845275 + 0.013281 \times Chest\\ + 0.020164 \times Abdomen\ + 0.035618 \times bicep\\ \] where the logit(p) is a special link from our linear combination of predictors to the probability of the outcome being equal to 1, and the coefficients are interpreted as changes in log-odds.
tab_model(glm13)
=======
<<<<<<< HEAD
sjPlot::tab_model(bf_step_back, bmi_step_back, M7, digits = 5, show.ci = FALSE)
=======
sjPlot::tab_model(bf_step_back, ln_bmi_step_back, digits = 5, show.ci = FALSE)
>>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a
>>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4
| <<<<<<< HEAD | overweight ======= | pct bf <<<<<<< HEAD | bmi | density ======= >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | log(bmi) >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p | |||||||
| (Intercept) | 1.52703 | 0.818 | <<<<<<< HEAD -10.94257 | <0.001 ======= 1.79891 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | 1.11041 | <0.001 | |||||
| neck | <<<<<<< HEAD-0.39650 | 0.076 | 0.00191 |
<0.001
=======
|
-0.39650
|
0.076
|
|
>>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a
|
| |||
| chest | -0.12810 | 0.156 | <<<<<<< HEAD 0.16090 ======= 0.00580 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | <0.001 | <<<<<<< HEAD======= >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | ||||||
| abdomen | 1.01805 <<<<<<< HEAD | <0.001 | 0.12726 | <0.001 | -0.00221 | <0.001 | |||||
| hip | -0.28758 | <<<<<<< HEADCI ======= | 0.002 | 0.05047 >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | 0.103 ======= | <<<<<<< HEAD =======<0.001 | 0.00472 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | 0.00113 | <0.001 | >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4||
| <<<<<<< HEAD bicep | <<<<<<< HEAD -3.85 | -4.40 – -3.29 ======= 0.26094 | 0.086 ======= hip | -0.28758 | 0.002 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | <<<<<<< HEAD<<<<<<< HEAD <0.001 ======= | ======= | |||
| bicep | 0.26094 | 0.086 | 0.00263 | 0.090 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | |||||||
| wrist | -1.55084 | 0.001 | <<<<<<< HEAD =======>>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | ||||||||
| <<<<<<< HEAD chest ======= age | >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a0.00043 | 0.106 | |||||||||
| thigh | <<<<<<< HEAD 0.14983 ======= 0.00751 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a | <0.001 >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | |||||||||
| knee | 0.01 | 0.00 – 0.03 | <<<<<<< HEAD 0.040 ======= <<<<<<< HEAD -0.00061 | 0.035 >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | |||||||
| <<<<<<< HEAD abdomen ======= knee >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | 0.02 | 0.01 – 0.03 | <<<<<<< HEAD <0.001 | ||||||||
| bicep ======= -0.23116 | ======= -0.00693 >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 >>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a0.04 | <<<<<<< HEAD 0.02 – 0.05 | <0.001 ======= | ||||||||
| forearm | 0.11484 | 0.011 | |||||||||
| forearm | 0.00426 | 0.021 >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 | |||||||||
| Observations | <<<<<<< HEAD======= | 250 | >>>>>>> 1aef8f86c451aee3917d7dfa1dbcbb6d6defb7c4 250 | ||||||||
| R2 Nagelkerke | <<<<<<< HEAD0.626 | ||||||||||
plot_model(glm13) + theme_bw(base_size = 10) + ylim(0, 0.10) + labs(x = "Overweight", y = "Odds Ratios", title = "Model Coefficients using odds scale")
plot_model(glm13, type = "pred", terms = c("abdomen", "chest", "bicep"), show.data = TRUE) +
theme_bw(base_size = 10)
In the Coefficient graph, the three significant variables have similar odd ratios giving the model a smaller confidence interval.
From the Prediction graph, the positive slopes in the highlighted areas indicate a larger abdomen and chest circumference leads to a high probability of being overweight. Comparing the three individual graphs, the slight difference in the slope’s y axis indicates bicep circumference correlates with odds of being overweight.
We correctly classified 91.2% of the observations, hence our resubstitution error rate, proportion of data predicted incorrectly using the fitted model, is 8.8%.
glm0 = glm(overweight ~ chest + abdomen + bicep, data = data)
data = data %>%
mutate(pred_prob = predict(glm0, type = "response"),
pred_surv = round(pred_prob))
mean(data$overweight == data$pred_surv)
## [1] 0.912
# not working
#library(caret)
#confusion.glm = confusionMatrix(
# data = as.factor(data$pred_surv),
# reference = as.factor(data$overweight))
#confusion.glm$table
The odds of being overweight for someone with an above average abdomen circumference of 120 is 1.26.
predict_overweight = data.frame(abdomen = 130, chest = mean(data$chest), bicep = mean(data$bicep))
predict(glm13, newdata = predict_overweight, type = "link")
## 1
## 1.260437
The odds of being overweight for someone with a below average abdomen circumference of 60 is -0.15.
predict_overweight = data.frame(abdomen = 60, chest = mean(data$chest), bicep = mean(data$bicep))
predict(glm13, newdata = predict_overweight, type = "link")
## 1
## -0.1510207
The odds of being overweight for someone with above average circumferences for their abdomen, chest, and bicep is 1.37.
predict_overweight = data.frame(abdomen = mean(data$abdomen)*1.2, chest = mean(data$chest)*1.2, bicep = mean(data$bicep)*1.2)
predict(glm13, newdata = predict_overweight, type = "link")
## 1
## 1.369055
library(partykit)
library(rpart)
ov_tree = rpart(overweight ~ abdomen + chest + bicep, data = data_overweight, method = "class",control = rpart.control(cp = 0.009))
ov_tree
## n= 250
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 250 125 0 (0.50000000 0.50000000)
## 2) chest< 101.55 144 23 0 (0.84027778 0.15972222)
## 4) abdomen< 92.6 126 10 0 (0.92063492 0.07936508) *
## 5) abdomen>=92.6 18 5 1 (0.27777778 0.72222222) *
## 3) chest>=101.55 106 4 1 (0.03773585 0.96226415) *
plot(as.party(ov_tree))
Through looking at the three models, body measurements appears to be able to explain most variations in BMI, followed by body fat percentage and then density. This suggests that if we only have simple body measurements on hand, it is better to use BMI as a source to determine obesity.
Abdomen has also been identified as the most significant factor (relatively) for prediction across all three methods and hence in normal day-to-day life, it is important to monitor the measurement for abdomen to avoid any weight-related sickness.
Using body fat percentage as the main guidance, we have identified ___% as overweight. This model aims to look at
=======Through looking at the three models, we can see that using simply body measurements, it is easier to predict changes in bmi rather than percentage body fat. From the models, measurements of abdomen appears to have the greatest influence on both body fat and bmi and hence any increase should be treated with caution.
>>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0adata %>%
ggplot() +
aes(x = pct_bf, y = overweight) +
geom_point(size = 10, alpha = 0.1) +
theme_classic(base_size = 10)
<<<<<<< HEAD
glm1 = glm(overweight ~ pct_bf + bmi + density, data = data)
summary(glm1)
##
## Call:
## glm(formula = overweight ~ pct_bf + bmi + density, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.08176 -0.25826 -0.03974 0.27116 0.52688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.962751 7.335444 -0.949 0.343
## pct_bf 0.012723 0.015522 0.820 0.413
## bmi 0.110788 0.008906 12.440 <2e-16 ***
## density 4.185205 6.680292 0.627 0.532
## ---
<<<<<<< HEAD
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
=======
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>>>>> 699b32c5c2d9581d6e91ed3d5569cba73d5e1b0a
##
## (Dispersion parameter for gaussian family taken to be 0.09839388)
##
## Null deviance: 62.500 on 249 degrees of freedom
## Residual deviance: 24.205 on 246 degrees of freedom
## AIC: 135.74
##
## Number of Fisher Scoring iterations: 2
<<<<<<< HEAD